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ABSTRACT Cultivation-independent surveys of microbial diversity have revealed many bacterial phyla that lack cultured repre- 
sentatives. These lineages, referred to as candidate phyla, have been detected across many environments. Here, we deeply se- 
quenced microbial communities from acetate-stimulated aquifer sediment to recover the complete and essentially complete ge- 
nomes of single representatives of the candidate phyla SRI, WWE3, TM7, and OD1. All four of these genomes are very small, 0.7 
to 1.2 Mbp, and have large inventories of novel proteins. Additionally, all lack identifiable biosynthetic pathways for several key 
metabolites. The SRI genome uses the UGA codon to encode glycine, and the same codon is very rare in the OD1 genome, sug- 
gesting that the OD1 organism could also transition to alternate coding. Interestingly, the relative abundance of the members of 
SRI increased with the appearance of sulfide in groundwater, a pattern mirrored by a member of the phylum Tenericutes. All 
four genomes encode type IV pili, which may be involved in interorganism interaction. On the basis of these results and other 
recently published research, metabolic dependence on other organisms may be widely distributed across multiple bacterial can- 
didate phyla. 

IMPORTANCE Few or no genomic sequences exist for members of the numerous bacterial phyla lacking cultivated representatives, 
making it difficult to assess their roles in the environment. This paper presents three complete and one essentially complete ge- 
nomes of members of four candidate phyla, documents consistently small genome size, and predicts metabolic capabilities on 
the basis of gene content. These metagenomic analyses expand our view of a lifestyle apparently common across these candidate 
phyla. 
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The current number of bacterial phyla recognized by rRNA da- 
tabases is between 63 and 84 (SILVA and GreenGenes, ac- 
cessed August 2013), although the true count is almost certainly 
higher, with some estimates as high as 100 phyla (N. R. Pace, 
personal communication, 2013) (1). A careful examination of 
naming conventions across databases and phylogeny allowed the 
dereplication of the list of phyla and suggested that there are at 
least 38 without cultivated representatives (2); these are referred to 
as candidate divisions (3, 4) or candidate phyla (CP). The increase 
in the number of CP over the last 10 years can be attributed in part 
to the subdivision of one major clade, initially referred to as OP1 1 
(5), that is now recognized to comprise several phyla, including 
OP11, OD1, and SRI (6). Some CP, such as TM7, are relatively 
well defined (5), while others, including WWE3 (7), and PER (8), 
have only recently been proposed. 

CP organisms have been detected by 16S rRNA gene sequenc- 
ing surveys spanning a wide array of environment types, including 
the human oral microbiome (9, 10), the mammalian gut (11), 
bioreactors (7, 12, 13), freshwater lakes (14, 15), hypersaline mi- 



crobial mats ( 1 ), and deep-sea vents (16). Targeted 16S rRNA gene 
primer assays have documented diversity within the CP and as- 
sessed the abundance of these organisms in certain environments, 
especially under anoxic and sulfidic conditions (6, 7, 13-15, 17, 
18). However, the full diversity and roles of CP organisms in the 
environment remain unclear. These questions have motivated the 
use of two cultivation-independent approaches to the sequencing 
of CP genomes, (i) single-cell flow systems coupled to multiple- 
displacement amplification and (ii) metagenomics. Single-cell se- 
quencing has generated genomic information for representatives 
of several CP ( 10, 19-23), but genomes are typically highly incom- 
plete and amplification bias affects the sequenced gene copy num- 
ber (24). Metagenomic methods have yielded nearly complete and 
complete genomic sequences of uncultivated groups in natural 
environments (8, 25-29). However, an impediment to the metag- 
enomic reconstruction of genomes is the relatively low abundance 
of CP cells in environmental samples. Biostimulation is one way to 
alter the profile of a microbial community, enriching for certain 
metabolic capabilities (e.g., see reference 30). For example, acetate 
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FIG 1 Complete and nearly complete genomes from four CP (RAAC1 to RAAC4) were reconstructed. (Left) Phylogeny of selected CP based on 16S rRNA genes. 
The maximum-likelihood tree shown was constructed from an alignment containing 111 taxa and 1,565 unambiguously aligned positions. Bootstrap values of 
>80% are displayed as filled circles (for accession numbers, see Fig. SI in the supplemental material). Also noted are previously sequenced partial and complete 
genomes for related members of the CP for which full-length 16S rRNA gene sequences were available (8, 10, 12, 19, 20, 22). (Right) ESOMs made by using 
differential coverage across the 14 sediment columns confirmed genome binning. 



amendment of groundwater combined with filtering (enriching 
for cells < 1.2 ju,m in diameter) at an aquifer in Rifle, CO, recently 
led to the successful reconstruction of 49 partial and nearly com- 
plete genomes (8) of members of several CP, including OD1, 
OP1 1, BD1-5, and PER. Such strategies have the additional advan- 
tage that they can potentially illuminate organism responses to 
specific stimuli, as well as correlations in organism abundance 
patterns. 

Here, we applied improved metagenomic methods to Illumina 
sequence data sets recovered from a series of biostimulated sedi- 
ment communities and assembled complete genomes corre- 
sponding to four CP: SRI, WWE3, TM7, and OD1. We report 
genome characteristics, metabolic potential, and abundance 
across a range of geochemical conditions and community compo- 
sitions Analysis of these new genomes and several other recently 
published genomes from the same CP indicates a surprisingly 
consistent life strategy and provides insight into why members of 
these phyla remain cultivated. 

RESULTS 

We conducted a time course biostimulation experiment with 
flowthrough sediment columns suspended within an aquifer at 
Rifle, CO. Thirteen columns were pumped with acetate-amended 
groundwater for 13 to 63 days, and individual columns were sac- 
rificed at points of geochemical interest (K. M. Handley et al., 



unpublished). Metagenomic sequencing of DNA from acetate- 
amended column sediment and an unamended background sed- 
iment sample revealed the presence of a variety of bacteria, includ- 
ing diverse members of the CP (I. Sharon et al., unpublished data). 

Genomic sequences of four CP organisms of interest were pres- 
ent in multiple samples. The fragmentation patterns of each ge- 
nome differed across the metagenomic data sets, making it possi- 
ble to identify overlaps in scaffolds from different samples and to 
generate high-quality draft genomes. Differential organism abun- 
dance patterns across the samples were used to confirm that all 
fragments >3 kb in length were correctly assigned to the four 
genomes (Fig. 1). Draft genomes were subsequently curated to 
complete and nearly complete genomes by using paired-end read 
information (see Materials and Methods). The reconstructed ge- 
nomes are presented here as RAAC1 to RAAC4 for Rifle acetate 
amendment columns I to 4. On the basis of phylogenetic analysis, 
the RAAC1 genome falls within SRI, RAAC2 falls within WWE3, 
a clade sister to OP11 (7), RAAC3 falls into TM7 group 3, and 
RAAC4 falls into OD1 (Fig. 1; see Fig. SI in the supplemental 
material). Notably, all four genomes are very small, 0.7 to 1.17 Mb 
in length (Table 1 ), within the range typically seen only in obligate 
symbionts (Fig. 2). Only the RAAC4 (OD1) genome is not circu- 
larized. Attempts to circularize this genome by PCR amplification 
failed, despite control reactions demonstrating that other seg- 
ments of the OD1 genome could be amplified from the sediment 
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TABLE 1 Genome information for the four CP genomes examined in this study 



Parameter 


RAAC1 (SRI) 


RAAC2 (WWE3) 


RAAC3 (TM7) 


RAAC4(ODl) 


Completeness 


Circular closed 


Circular closed 


Circular closed 


Not circular, 3 gaps 


T PTicitn ( r»n 1 


1 177 760 


878 109 


845 464 


U7 J, JZiO 


Relative GC content 


0.31 


0.43 


0.49 


0.31 


Avg ORF length (bp) 


1,011 


926 


837 


908 


Avg intergenic distance (bp)° 


92 


66 


72 


86 


Relative protein coding density 


0.91 


0.92 


0.91 


0.9 


No. oftRNAs 


37 


45 


43 


42 


No. of protein-coding genes 


1,059 


874 


921 


687 


No. (%) of ORFs with predicted 


596 (56) 


618 (71) 


701 (76) 


504 (73) 


function 










No. (%) of ORFs with domain- 


115 (11) 


49 (6) 


38 (4) 


33 (5) 


only prediction 1 ' 










No. (%) of conserved 


106 (10) 


121 (14) 


143 (16) 


96 (14) 



hypothetical ORFs c 

rt Average intergenic distance was calculated by including all nonzero distances between protein- and RNA-coding sequences. 
b Using IPRSCAN software version 4.6, data version 35.0. 

c Based on best-hit annotation to Uniref90 database 2012_9 with sequences from reference 8 removed to simplify post-processing. 



column extract. Closely related genomes of similar sizes have been 
recovered from the same site, suggesting that the genome is likely 
to be nearly complete (unpublished data). While each genome 
represents the composite sequence of a population, very little 
strain variation was observed, with the exception of TM7, for 
which several closely-related strains were observed. 

A previous study of planktonic cells from the same site recon- 
structed genomes related to those reported here (8), but lack of 
associated 16S rRNA gene sequences complicated classification 
efforts. The current genomes allow phylogenetic resolution via 
16S rRNA and protein trees (Fig. 1; see Fig. S2 in the supplemental 
material). One partial genome, ACD25, previously classified as 
OP11 (8), matches the WWE3 genome (RAAC2) at the species 
level and has an identical protein sequence for the DNA-directed 
RNA polymerase beta subunit (see Fig. S2). On the basis of anal- 
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FIG 2 GC content versus genome size for the RAAC1 to RAAC4 genomes 
with reference data from NCBI. Biotic relationship categories are as described 
by Giovannoni et al. (49). Open red circles represent the following marine 
bacteria with small genomes (left to right): Pelagibacter ubique (free living), 
Candidatus Atelocyanobacterium thalassa (UCYN-A, a symbiont), and Pro- 
chlorococcus marinus (free living). 



yses reported here, ACD22 and ACD24 are also reassigned to phy- 
lum WWE3. Another genome, ACD80, previously classified as a 
distant BD1-5 relative, is reclassified as belonging to phylum SRI, 
in agreement with Campbell et al. (10), and a tentatively binned 
scaffold containing a 16S rRNA gene was confirmed as belonging 
to that genome (Fig. 1; see Fig. SI) (8). 

Time series relative abundance. The four CP genomes exhibit 
distinct enrichment patterns across the 14 samples (Fig. 3). Ace- 
tate stimulation resulted in the early increase and then decline of 
betaproteobacteria, followed by an increase in members of the 
class Clostridia (Fig. 3 ) , a pattern that largely paralleled the shift in 
terminal-electron-accepting processes during acetate amendment 
(31) (Handley et al, unpublished). The change from iron reduc- 
tion to sulfate reduction is reflected in the exponential variation in 
the groundwater sulfide concentration across samples (see 
Fig. S3A in the supplemental material). The SRI genome rose in 
relative abundance by as much as 500-fold as sulfide concentra- 
tions increased; its abundance also correlated strongly with that of 
a member of the phylum Tenericutes (see Fig. S3B). The log rela- 
tive abundance of the SRI genome is directly correlated with log 
sulfide concentration until peak sulfate reduction (see Fig. S3A) 
(R 2 =0.78, P < 0.0005), after which abundance continued to rise 
and sulfide levels began to decline. This correlation could be due 
to a variety of factors, including metabolic relationships between 
the members of SRI and organisms performing sulfate reduction. 
Of the CP genomes, the WWE3 genome appears to be the most 
stably abundant (at 0.13 to 0.58%) from mid-iron reduction 
through sulfate reduction. The TM7 and OD1 genomes had the 
highest relative abundance in samples 3 and 7, before notable 
sulfate reduction occurred (Fig. 3). 

Novel proteins and metabolic characterization. On the basis 
of annotations, each of the four CP genomes contains a large frac- 
tion of proteins lacking functional predictions and proteins whose 
only homologs are hypothetical proteins (Table 1) (8, 10, 12, 19, 
20). When the four genomes were searched against hidden 
Markov models (HMMs) representing all known protein families 
(called "sifting families" or "SFams") (see reference 32), between 
39% (OD1) and53% (SRI) ofthe CP sequences were not matched 
to any family (see Fig. S4 in the supplemental material). Attempts 
to cluster novel proteins in order to identify previously unknown 



September/October 2013 Volume 4 Issue 5 e00708-13 



Bio' mbio.asm.org 3 



Kantor et al. 



CD 
Q_ 



g 1 RAAC1 (SR1) 
t 

o - 

CM 

d ~ 



CO 

d 
d 

CM 

d 
o 



CD 

d 
t 
d 

CM 

d 



CO 

d 
t 
d 

CM 

d 



~ i 1 1 1 1 1 1 r 

1 2 3 4 5 6 7 8 

RAAC2 (WWE3) 



□ 



9 10 11 12 13 14 



n 



□ 



iii 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 



RAAC3 (TM7) 



n 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 
RAAC4 (OD1) 



i — i — t — i — i — i — i — i — i — i — i — ~n — r~ 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 



o 

.!= o 

o) m - 



N/A 



Q 



t i i i i r 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 



O. CO 

0 



□ 



N/A 

~~ I ' I " I " I " I " I " I " I "' I " I " I 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 



o 
O 




1 2 3 4 5 6 7 8 9 10 11 12 13 14 
Sample number 



■ Betaproteobacteria 

■ Gammaproteobacteria 

□ Deltaproteobacteria 

□ Bacilli 

■ Clostridia 



■ Anaerolineae 

■ Sphingobacteria 

□ Bacteroidetes incertae sedis 

■ Flavobacteria 

□ Holophagae 

□ Others(<5%) 



Sample number 

FIG 3 The relative abundance of the RAAC1 to RAAC4 genomes varies across samples representing a range of geochemical conditions and microbial 
community compositions. (Left) Percent relative abundance (y axis) across the 14 independent sediment columns (x axis). All four genomes were found at <1% 
relative abundance in every sample. (Right) Endpoint ferrous iron and sulfate concentrations measured in column effluent and community composition. (A 
discussion will appear elsewhere [Handley et al., unpublished].) Sample 1 represents unamended background sediment, and the dashed line roughly divides 
samples from columns undergoing iron reduction (samples 2 to 8) versus sulfate reduction (samples 9 to 14). Samples 13 and 14 were taken after peak sulfate 
reduction had occurred. 



protein families were largely unsuccessful, likely because of the 
wide phylogenetic distances among the four genomes (data not 
shown). 

Analyses by using KAAS (KEGG Automatic Annotation Server 
[33]), independent BLAST searches against the CP genomes, and 
gene-by-gene manual analysis for all genomes were performed to 
assess the completeness of central metabolic pathways. Consistent 
with previous analyses of other bacteria from these CP (8, 10, 12, 
20), all of the genomes examined support fermentative metabo- 
lisms (Fig. 4) and lack tricarboxylic acid cycle genes. Superoxide 
dismutase and alkyl hydroperoxide reductase genes are present, 
presumably involved in resistance to oxidative damage. The four 
genomes encode distinct fermentation pathways, different abili- 
ties to use and store complex carbon, different electron-carrying 
proteins, and some key unique pathways, although consistency 
was observed within the individual CP (see Fig. S5 in the supple- 
mental material). 

RAAC1 (SRI). The SRI genome lacks genes involved in the 
initial steps of the Embden-Meyerhof-Parnas (EMP), pentose 
phosphate, and Entner-Doudoroff pathways, consistent with our 
analysis of ACD80 and SR1-OR1 (8, 10). In RAAC1, genes were 
identified that encode triose phosphate isomerase and the lower 
portion of the EMP pathway, from the conversion of 



glyceraldehyde-3-phosphate to 3-phosphoglycerate by a possible 
nonphosphorylating glyceraldehyde 3 -phosphate dehydrogenase 
(GapN) through the formation of pyruvate (Fig. 4A; see Fig. S5 in 
the supplemental material). The other two SRI genomes, ACD80 
and SR1-OR1, do not appear to contain these genes. The potential 
for gluconeogenesis is indicated by the presence of pyruvate phos- 
phate dikinase, but the other gene involved in this pathway, fruc- 
tose 1,6-bisphosphatase, was not identified in any SRI genome. 
The RAAC1 genome possesses a gene cluster responsible for the 
fermentation of pyruvate to acetate and formate via pyruvate for- 
mate lyase (PFL), an alternative phosphotransacetylase (PduL) 
(34), acetate kinase, and a putative formate transporter. This path- 
way is also found in SR1-OR1, though not in a gene cluster. The 
organism may also degrade complex carbon by using a dockerin- 
like protein, a cellulosome anchoring protein, and several cell sur- 
face glucanases and pectin lyases, as observed in the other SRI 
genomes. 

Also as reported previously for the two partially reconstructed 
SRI genomes (8, 10), the SRI genome presented here harbors a 
type II/III intermediate form of ribulose 1,5-bisphosphate carbox- 
ylase/oxygenase (RubisCO). This type of RubisCO is implicated in 
the AMP salvage pathway (35) identified in RAAC1 (Fig. 4A) and 
both of the other SRI genomes sequenced to date. RAAC1 
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FIG 4 Cell diagrams depicting central carbon metabolism, proteins putatively involved in electron transfer, and transporters in the CP genomes. Panels: A, 
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RubisCO possesses conserved catalytic residues and an additional 
29-amino-acid sequence unique to this form of RubisCO. 
The sequence identity across all three SRI RubisCOs is high 
(72%), suggesting that they play similar roles in the metabolism 
of these organisms. As suggested by Sato et al. (35), the 



3-phosphoglycerate produced from the AMP salvage pathway 
may enter glycolysis for energy generation. 

Lastly, the RAAC1 SRI genome encodes several electron- 
carrying proteins, including a cytochrome b 5 , three Fe-S cluster 
proteins of unknown function, a ferredoxin reductase-like pro- 
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tein, three flavodoxins, and a rubrerythrin. Some of these may be 
important for the response to oxidative stress and/or reoxidation 
of reduced ferredoxin or NADH. 

RAAC2 (WWE3). The WWE3 genome possesses most of the 
genes for the EMP pathway, pyruvate ferredoxin oxidoreductase 
(PFOR, or possibly 2-oxoglutarate ferredoxin oxidoreductase 
subunits alpha and beta), which converts pyruvate to acetyl coen- 
zyme A (acetyl-CoA), and an acetyl-CoA synthetase (ADP form- 
ing, EC 6.2.1.13) that produces acetate and ATP (Fig. 4B). There is 
also a gene cluster for the lower portion of the pentose phosphate 
pathway, converting ribulose- 5 -phosphate to glyceraldehyde-3- 
phosphate. Located in the same gene cluster is a protein identified 
as belonging to the D-isomer-specific 2-hydroxy-acid dehydroge- 
nases, a superfamily that contains glyoxylate reductase and 
D-lactate dehydrogenase (36). The presence of phosphoenolpyru- 
vate synthase suggests that the WWE3 organism may engage in 
gluconeogenesis, but it lacks fructose 1,6-bisphosphatase. These 
observations are consistent with the analysis of the newly reclassi- 
fied WWE3, ACD22, ACD24, and ACD25 genomes (see Fig. S5 in 
the supplemental material), although these genomes are frag- 
mented and incomplete and may be subject to binning error. The 
RAAC2 WWE3 organism may also be capable of synthesizing and 
utilizing glycogen, a common energy storage polysaccharide, via 
an operon containing a predicted galactose- 1 -phosphate uridylyl- 
transferase, two glycogen synthase genes, an alpha-amylase-like 
gene, and a glycosyl hydrolase of family 57, the family containing 
the branching enzyme required to form glycogen. 

The ATP synthase operon in the WWE3 genome contains al- 
pha, beta, gamma, and delta/epsilon subunits; however, the adja- 
cently located a, b, and c subunits lack significant homology to 
known ATPase subunits. Instead, the putative c subunit, respon- 
sible for ion translocation, bears homology to a similar transmem- 
brane protein found in the ATPase operons of the OP 11 and 
WWE3 genomes, ACD38 and ACD24 (8). Given the lack of ho- 
mology to characterized ATPases (37), it remains unclear whether 
the ATPase, if functional, operates primarily in the forward or 
reverse direction and whether it pumps protons or sodium ions. 
This genome also contains a membrane-bound proton/sodium- 
pumping pyrophosphatase, which could be used to generate 
membrane potential. Electron carriers in the WWE3 organism 
include a ferredoxin oxidoreductase-like protein, a cytochrome 
b 5 , and a plastocyanin-like blue copper protein encoded in the 
genome adjacent to a predicted membrane-associated ferric 
reductase-like protein (Fig. 4B). Additional electron flow may 
proceed through a type 3B-like cytoplasmic nickel-dependent hy- 
drogenase homologous to those found in another WWE3 ge- 
nome, ACD22, and several OD1 genomes (8). 

RAAC3 (TM7). The pentose phosphate pathway and most of 
the EMP pathway are present in the TM7 genome (Fig. 4C), but 
enolase was not detected, nor was a means for converting pyruvate 
to acetyl-CoA (e.g., PFOR, PFL, or pyruvate dehydrogenase) or 
for using acetyl-CoA. We confirmed that enolase is not annotated 
or identifiable in any of the currently available TM7 genome se- 
quences (12, 19, 20) (see Fig. S5 in the supplemental material). 
RAAC3 contains phosphoketolase but not acetate kinase, needed 
for the generation of ATP from this branch off the pentose phos- 
phate pathway (Fig. 4C, box 19). Although both genes were iden- 
tified in a gene cluster in a related genome (12), synteny between 
the two genomes in this region is lacking. Other possible routes for 
the fermentation and regeneration of NAD + by the RAAC3 TM7 



organism include two dehydrogenases related to D-lactate dehy- 
drogenase, which are conserved in other TM7 genomes (12). 
Malate/lactate dehydrogenase, which has been reported in an- 
other TM7 genome (12), was not identified in RAAC3. As an 
alternative means of producing ATP, RAAC3 encodes the arginine 
deiminase pathway (Fig. 4C). We identified genes in this pathway 
in other TM7 genome sequences (12, 19), suggesting that it maybe 
common to members of the TM7 phylum. 

The TM7 (RAAC3) genome has several genes involved in com- 
plex carbon degradation, including beta-glucosidase, a predicted 
secreted endo-l,3(4)jS-glucanase, a-amylase, and a glycogen 
phosphorylase-like protein. The presence of a bifunctional treha- 
lose synthase/phosphatase indicates the use of trehalose, which is 
synthesized from glucose-6-phosphate and UDP-glucose. 
Alpha,alpha-trehalase is also present, providing for the subse- 
quent degradation of trehalose and release of glucose for cellular 
consumption. 

The RAAC3 TM7 genome contains a complete ubiquinol oxi- 
dase (cytochrome b a ) operon, with intact functional residues, as 
well as residues known to distinguish cytochrome O ubiquinol 
oxidases from their closely related counterparts, cytochrome c 
oxidases (38). This complex could be used for oxygen scavenging, 
as all other information points to a fermentation-based metabo- 
lism and we did not find the complex in other TM7 genomes. The 
electron source for ubiquinol oxidase may be a single-subunit 
form of NADH: ubiquinone dehydrogenase (NDH) most similar 
to type II NDH in structural searches (39, 40). This NDH is lo- 
cated adjacent to the ubiquinol oxidase operon in the RAAC3 
genome and has homologs in other TM7 genomes (12, 19). 

RAAC4 (OD1). OD1 is a very diverse CP, so it is unsurprising 
that variation in metabolic capabilities exists. Like TM7, the 
RAAC4 OD1 genome contains genes for the pentose phosphate 
pathway, as does the recently sequenced AAA011-A08 genome 
(22). However, the AAA255-P19 OD1 genome from the same 
work does not contain this pathway and two other partial OD1 
genomes examined support only the latter half of the pathway (see 
Fig. S5 in the supplemental material). RAAC4 possesses a modi- 
fied EMP pathway from mannose-6-phosphate isomerase to py- 
ruvate (Fig. 4D). Also similar to TM7, and consistent with other 
OD1 members analyzed here, it does not contain any identifiable 
PFL, PFOR, or pyruvate dehydrogenase genes or genes for the 
utilization of acetyl-CoA. Like the WWE3 organism, the OD1 or- 
ganism may be capable of fermentation to lactate, as indicated by 
a 2-hydroxy-acid dehydrogenase family gene found adjacent to 
enolase and pyruvate kinase in the genome. The synteny of the 
genes in this cluster is not conserved in the other OD1 genomes 
examined. A predicted lactate transporter is encoded elsewhere in 
the RAAC4 genome. The presence of a putative cellulosome- 
related gene cluster, two glycosyl hydrolases, and an end-specific 
cellobiohydrolase suggests that OD 1 is capable of complex carbon 
degradation. The RAAC4 OD1 genome also contains a gene clus- 
ter involved in alternative polyamine biosynthesis, as described by 
Lee et al. (41), which may be important in biofilm formation. 

The RAAC4 OD1 genome may use a membrane-bound 
sodium/proton-pumping pyrophosphatase to generate a proton 
motive force. Electron transport proteins include a rubredoxin 
and flavoprotein of unknown function, perhaps involved in oxy- 
genic stress tolerance. While other, previously described, OD1 ge- 
nomes were found to contain putative nickel-iron hydrogenases 
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involved in uptake and hydrogen production, no hydrogenases 
could be identified in RAAC4. 

Essential biosynthetic pathways. While the RAAC genomes 
have easily identifiable enzymes for the interconversion of amino 
acids and nucleotides (e.g., serine hydroxymethyltransferase [with 
the exception of WWE3] and dCTP- or dCMP-deaminase), com- 
plete biosynthesis pathways for nucleotides (see Table S2 and 
Fig. S6 in the supplemental material), lipids, and most amino acids 
(42) could not be identified in any of the four genomes on the basis 
of annotations or by using KAAS (33). On the basis of this prelim- 
inary analysis, we concluded that these organisms may be auxo- 
trophic for many essential metabolites or may contain novel bio- 
synthetic pathways. 

Further analysis was performed to assess the completeness of 
nucleotide biosynthesis by using reference sets of genes from di- 
verse genomes to query RAAC genomes (see Materials and Meth- 
ods). The members of SRI appear to be missing most of this path- 
way. If novel genes present in this phylum could form parts of this 
pathway, we might expect them to be conserved and possibly lo- 
cated near identifiable genes involved in nucleotide biosynthesis. 
However, few biosynthesis genes are present in all three of the SRI 
genomes and the regions containing these genes lack synteny, of- 
fering no clues to novel conserved genes that may perform the 
missing functions. The possibility remains that such genes are 
found elsewhere in the genomes. Similarly, the WWE3 and TM7 
genomes show few genes involved in nucleotide biosynthesis and 
lack synteny surrounding these genes. 

Some sequenced representatives of OD 1 may have functional 
pathways for nucleotide biosynthesis (e.g., single-cell genome 
AAA255-P19) (22), but RAAC4 (OD1) does not (see Fig. S6). 
Gene loss or horizontal gene transfer could explain these differ- 
ences in metabolic potential between the members of OD1. 
Whereas the AAA255-P19 genome contains pyrimidine and pu- 
rine biosynthesis genes in distinct operons, genes that correspond 
to these pathways in the other CP genomes are not found in clus- 
ters (see Table S2). 

Given the lack of complete pathways for the biosynthesis of 
some essential metabolites, we examined the possibility that the 
RAAC CP organisms could scavenge these compounds from their 
surroundings. The RAAC genomes contain numerous nucleases 
and proteases, as well as several transporters, whose substrates are 
unknown. 

Cell surface and environmental interactions. None of the 
RAAC1 to RAAC4 CP organisms appear to make lipid A or lipo- 
polysaccharide, as indicated by the absence of genes for biosyn- 
thesis, including IpxC and kdsA (43), and as suggested by previous 
work on another TM7 genome ( 12). All of the genomes except the 
WWE3 genome contain complete, identifiable pathways for pep- 
tidoglycan synthesis (see Fig. S7A in the supplemental material). 
The abundance of glycosyltransferases in all four genomes, partic- 
ularly the WWE3 genome, suggests that the organisms devote sig- 
nificant energy to the production of polysaccharides, glycopro- 
teins, and/or a glycosylated S-layer. Additionally, the SRI and 
WWE3 genomes contain genes for the synthesis of dTDP- 
rhamnose (see Fig. S7A). The genomes encode proteins contain- 
ing one or more of the following domains: concanavalins/lectins, 
pectin lyases, fibronectin III, beta propeller, and polycystic kidney 
disease, some of which are predicted to be cell surface domains in 
Bacteria and Archaea (see Fig. S7B) (44). Many of these predicted 
proteins are large, up to 5,900 amino acids, and some have signal 



peptides or sortase motifs suggesting possible cell wall localiza- 
tion. 

Sortases, which covalently attach surface proteins to the cell 
wall of Gram-positive bacteria, are present in the SRI, WWE3 , and 
TM7 genomes, and predicted sorted proteins were found in 
WWE3 and TM7. Each of the four genomes encodes the required 
components for type IV pilus biosynthesis, including pilT (45), for 
twitching motility, and for several predicted pilins (see Fig. S7B) 
(46). The TM7 genome has multiple type II and IV pilus-related 
gene clusters and additional pilins, totaling 60 genes, a full 6% of 
the TM7 genome. Importantly, the pili present in these CP ge- 
nomes are not related to the sortase-associated pili more com- 
monly found in Gram-positive bacteria (47). Rather, they are ho- 
mologous to type IV pili sometimes involved in the uptake of 
environmental DNA (48). Consistent with this possible function, 
each genome contains at least one copy of ComEC, the DNA- 
specific pore-forming protein required for competence, and DNA 
protection protein DprA (see Fig. S7B). 

Translation and coding. Codon usage in both the SRI and 
OD1 genomes is skewed toward low-GC codons, as expected 
given the low GC content across these genomes (see Fig. S8 in the 
supplemental material). Additionally, the SRI genome uses alter- 
nate coding, as reported for another SRI genome (10). This ge- 
nome, RAAC1, and the previously reported SRI genomes (8, 10) 
contain nearly identical genes for tRNA UCA , suggesting that the 
corresponding codon, UGA, is not read as termination but rather 
as an amino acid. Concordantly, open reading frame (ORF) de- 
tection with code 1 1 (bacterial) yielded extremely short sequences 
and an unreasonably high frequency of split genes (a total of 2,288 
predicted ORFs with an average length of 289 bp), while transla- 
tion with code 4 (UGA read as tryptophan) gave typical ORF sizes 
and complete genes (Table 1). However, unlike code 4, where 
UGA encodes tryptophan, conserved positions in protein align- 
ments indicate that UGA most likely encodes glycine, as described 
for the oral SRI genotype by Campbell et al. ( 10). Interestingly, the 
SRI genome also harbors duplicated and interrupted tRNA syn- 
thetases. Specifically, there is an extra, fragmented copy of both 
valine- and alanine-tRNA synthetases and an unusual isoleucine- 
tRNA synthetase that appears to be split into four regions by a 
nudix hydrolase domain and two phosphoglycerate mutase do- 
mains. The same tRNA synthetases were also found in the ACD80 
genome. 

The OD1 genome has an extremely small number of UGA stop 
codons (5% of all ORFs, compared to 22% in WWE3 and 15% in 
TM7), suggesting that this codon may be approaching extinction 
and possible reassignment, leading to alternate coding. The OD1 
genome also contains a predicted suppressor tRNA that reads the 
codon UAA. However, the majority of the predicted genes in this 
genome use UAA for termination, which suggests that this may be 
a pseudo-tRNA or recognize a different codon. 

DISCUSSION 

Metabolic predictions for all four genomes point to a primarily 
fermentation-based lifestyle and an inability to synthesize essen- 
tial metabolites. However, many predicted ORFs were annotated 
only at the protein domain level or not at all (Table 1), and some 
unannotated proteins may complete metabolic pathways that ap- 
pear to be broken in or absent from the CP genomes. As more 
sequence data from the CP become available, conserved protein 
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families will likely emerge, preparing the way for further explora- 
tion at the phylogenetic, biochemical, and structural levels. 

The most remarkable feature of all four complete or essentially 
complete CP genomes is their small size. The first estimate of the 
genome size of a member of TM 7, based on fragmentary single- 
cell sequencing data, was substantially larger (20). Our result, a 
genome size of 0.85 Mbp for RAAC3, parallels the recent finding 
of a 1.01-Mbp genome for another TM7 organism (12). Campbell 
et al. suggested that the genome of a member of SRI was <2 Mbp 
(10), somewhat larger than the 1.17-Mbp genome size of RAAC1. 
The expectation of small OP 11 and OD1 genome sizes was also 
noted by Wrighton et al. (8). Together, data presented here and 
recently published results suggest that small genome sizes are 
common across multiple phyla. 

Genomes of the small sizes reported here are found in some 
free-living marine bacteria and in obligate symbionts (Fig. 3), 
both of which may be descended from organisms with larger ge- 
nomes. Mechanisms for genome reduction in free-living bacteria 
include streamlining (e.g., decreased intergenic distance and loss 
of nonessential genes and pathways) or metabolic specialization, 
and in obligate symbionts, directed loss of genes whose functions 
are provided by the host (49-51). No close relatives with larger 
genomes have been sequenced to date, and small genome size may 
indeed be an ancestral trait of these CP. 

Examination of pseudogenes can sometimes reveal the evolu- 
tionary trajectory of bacteria with reduced genome sizes. Genome 
erosion and accumulation of pseudogenes are characteristic of the 
early stages of evolving symbiosis in bacteria (51), whereas an 
elimination of pseudogenes is suggestive of genomic streamlining 
(50) or later stages of symbiosis (51). However, until more closely 
related CP genomes are sequenced, it will be difficult to determine 
which unannotated genes are truly pseudogenes and which serve 
novel functions in certain lineages. The coding density of the four 
CP genomes (around 0.90; Table 1 ) is lower than that reported for 
some organisms that have undergone genomic streamlining (e.g., 
0.97 in Prochlorococcus marinus and Pelagibacter ubique) (49, 52) 
but higher than the value for an obligate symbiont, where stream- 
lining was not a mechanism of genome reduction (0.81 in Candi- 
datus Atelocyanobacterium thalassa) (53, 54). 

Genome reduction has been suggested as a driving factor in the 
switch to alternate coding in some symbiotic alphaproteobacteria 
and mitochondria (55, 56). In these groups, which use code 4, a 
single tRNA Tr P has mutated to accommodate both UGG and UGA 
via wobble pairing, allowing the elimination of tRNA and termi- 
nation factor genes (55, 56). While there is currently no consistent 
bioinformatic method for defining tRNA-to-amino-acid specific- 
ity (57), protein alignments and biochemical evidence (10) are 
convincing arguments for the recoding of UGA to glycine in all of 
the SRI genomes reported thus far, and this coding in RAAC1 
supports the suggestion by Campbell et al. that this may be a 
phylum level trait (10). In contrast to the code 4 organisms men- 
tioned above, the SRI genomes appear to use not wobble pairing 
but rather an additional tRNA G1 y UCA to achieve alternate coding 
(10). The use of UGA for glycine could be a mechanism for reduc- 
ing genomic GC content, as all other glycine codons are more GC 
rich (see Fig. S8 in the supplemental material). 

The CP genomes appear smaller than those of typical free- 
living bacteria, at least in part because of missing metabolic func- 
tions. Recently, McLean et al. noted a similarly small genome size 
for a member of TM6 (23) and suggested that it maybe a symbi- 



ont. Our analysis suggests that this may also be true for some or all 
of the organisms reported here. If several core biosynthetic path- 
ways are, in fact, absent from the CP bacteria described here, the 
organisms likely rely heavily on one (possibly a member of the 
phylum Tenericutes in the case of the members of SRI) or more 
community members in a manner similar to symbiosis. Type IV 
pili, encoded by all four genomes, may aid the cells in interacting 
with the environment and with other organisms via adhesion to 
extracellular surfaces, DNA uptake, and biofilm formation (46). 
Other adhesion- or biofilm-related proteins may also be impor- 
tant to the life strategies of these organisms. Transporters, nu- 
cleases, and proteases could allow the organisms to make use of 
metabolites provided by biomass in their environment or by a 
host. A potential dependence on other organisms may explain 
why these CP bacteria remain uncultured. 

MATERIALS AND METHODS 

Field experiment. The experimental conditions used in this study and 
sample collection in the field will be described elsewhere in detail (Hand- 
ley et al., unpublished). The present study focused on acetate-amended 
sediment collected between August and November of 2010. Thirteen 
flowthrough columns packed with sieved (<2 mm) sediment were placed 
into one of three wells at the U.S. Department of Energy Integrated Field 
Research Challenge (IFRC) aquifer in Rifle, CO. Columns were equili- 
brated to conditions in the subsurface for 1 week. Subsequently, ground- 
water amended with 10 mM sodium acetate was pumped upward through 
the columns at an approximate rate of 52 ml day -1 . Individual columns 
were sacrificed for sampling between 13 and 63 days of amendment such 
that a range of geochemical conditions from iron reduction to sulfate 
reduction was sampled. Unamended sieved sediment was taken as a back- 
ground sample. Geochemical measurements of filtered column effluent 
were made on the day of column sacrifice or on the day before. Aqueous 
ferrous iron and sulfide were quantified immediately following the collec- 
tion of effluent with 1,10-phenanthroline and methylene blue colorimet- 
ric assays (Hach Company, Loveland, CO). Sulfate was measured by ion 
chromatography (ICS-2100 [Dionex, Sunnyvale, CA] fitted with an AS- 1 8 
guard and analytical column). 

DNA extraction and sequencing. DNA was extracted from an average 
of 25 ± 7 g of acetate-amended sediment (samples 2 to 14) and 42. 1 g of 
sieved background sediment with PowerMax Soil DNA Isolation kits 
(MoBio Laboratories, Inc., Carlsbad, CA), with the following modifica- 
tions of the manufacturer's instructions. Sediment was vortexed for an 
additional 3 min in SDS at maximum speed and then incubated for 30 min 
at 65°C in place of bead beating. Extracted DNA was precipitated with 
cold ethanol, Na-acetate (0.3 M, pH 5.2), and glycogen (50 p,g/ml) and 
re-eluted in 50 fx\ elution buffer. Illumina sequencing was conducted at 
the UC Davis DNA Technologies Core Facility (http://dnatech 
.genomecenter.ucdavis.edu) by using paired-end 101-bp reads with an 
insert size of 500 bp. Sequencing was distributed across five lanes to pro- 
duce between 3 and 6 Gbp of sequence for each of the 1 3 amended samples 
and 15 Gbp for the background sample. 

Metagenomic assembly and curation. Sequence data sets for each 
sample were assembled independently by using idba_ud with default pa- 
rameters (58). This generated 61.0 to 107.1 Mbp of sequence information 
per sample and 16.8 Mbp for background sediment on scaffolds of >5 kb. 
Genes on all scaffolds >5 kb in length were predicted by using Prodigal 
with the metagenome option (59, 60). Scaffolds for which Prodigal chose 
code 4 (UGA translated as tryptophan instead of termination) were man- 
ually curated into a genome identified as belonging to SRI. The genome 
was retranslated with UGA encoding glycine (see Results). For each scaf- 
fold, we determined the GC content, coverage, genetic code, and profile of 
phylogenetic affiliation based on the best hit for each gene in Uniref90 
(61). On the basis of analyses of these data, as well as emergent self- 
organizing map (ESOM)-based analyses of tetranucleotide frequencies 
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and time series relative abundance (62, 63), draft genomes were generated 
that included scaffolds from multiple samples. Scaffolds for the same ge- 
nome found in different samples were aligned to yield longer fragments, 
leveraging the observation that fragmentation of assemblies is, to some 
extent, dependent on the context (community composition). Read map- 
ping used Bowtie (64), and paired-read information was used to extend 
and join contigs and to fill in gaps left by the assembler (63). A few regions, 
particularly those containing short repeats (a few hundred bases or less), 
could not be completely resolved, but the connectivity of their two sides 
was confirmed. 

Functional annotation. Predicted ORFs were run through a multida- 
tabase search pipeline for functional prediction as previously described 
(8). Briefly, sequence similarity searches were performed with USEARCH 
(65) against UniRef90 (61) and the KEGG database (Kyoto University 
Bioinformatics Center). Domain level functional annotation was done 
with InterProScan (66). RNA was predicted by a combination of database 
searching and tRNAScan-SE (67) for tRNAs. Cellular localization was 
predicted with PSORTb (version 3.0) (68) and by detection of the sortase 
cleavage motif, LPXTG (69), with in-house scripts. 

Phylogenetic analysis. Genomes were placed into phylogenetic con- 
text based on analysis of the 16S rRNA gene sequences. Sequences were 
aligned with the SILVA database by using the SINA alignment service 
(http://www.arb-silva.de/aligner) (70). Representative sequences from 
TM7, PER, BD1-5, SRI, OD1, WS6, WWE3, and OP11 were obtained 
from SILVA in aligned form (see the accession numbers provided in Fig. 
SI ). Conserved gaps were removed from compiled aligned sequences with 
GapStreeze v.2.1.0 with the gap tolerance set to 99% (http://www.hiv.lanl 
.gov/content/sequence/GAPSTREEZE/gap.html). The alignment was 
further trimmed to remove uninformative regions. A maximum- 
likelihood tree was constructed with RAxML by using the GTRCAT 
model with 1,000 bootstraps. For WWE3, which is not recognized as a 
phylum by SILVA and is called "otu_4443" in GreenGenes (accessed June 
2013), the SINA aligner was used to find sequences with >80% identity to 
the genomic 16S rRNA gene sequence. These sequences were included on 
a 16S rRNA gene tree containing multiple CP and formed a phylum level 
branch. 

Overall community composition. Percent relative abundances of ge- 
nomes in each of the 13 samples and the background sediment sample 
were calculated on the basis of mapping of unpaired reads from each 
sample against each genome with Bowtie2 (version 2.0.4) with the settings 
— phred33 and —fast with default specificity (71). Community composi- 
tion by sample was determined with EMIRGE (72) by using 80 iterations 
and SILVA 108 clustered at 97% identity as the reference database. Chi- 
meras were detected with Uchime (65). Reconstructed 16S rRNA gene 
sequences were analyzed with RDP-classifier (http://rdp.cme.msu.edu 
/classifier/classifier.jsp), and relative abundances were calculated by 
EMIRGE on the basis of read mapping normalized for sequence length, 
operational taxonomic unit abundances were summed within each phy- 
logenetic order represented (members of the class Proteobacteria were 
summed at the class level). All orders with abundances of <5% in all 
samples were included in the category "other" (Fig. 3). 

Novel protein analysis. An original database and an updated database 
of HMMs representing sifting families (32) were obtained from http: 
//edhar.genomecenter.ucdavis.edu/sifting_families. These were compiled 
into a searchable form with hmmpress (Janelia Farm, 2010, h3.0). Amino 
acid sequences from each genome were searched against the database by 
using hmmsearch with a reporting cutoff of 1E-5 and parsed with an 
alignment coverage threshold of 80% for both the HMM and the query 
gene. 

Metabolic pathway analysis. The initial analysis relied upon gene an- 
notations from an in-house pipeline (described above) with functional 
residues confirmed in proteins of interest. Subsequently, amino acid se- 
quences were submitted to KAAS (http://www.genome.jp/kaas-bin/kaas 
_main?mode= partial) (33) by using a customized search list of diverse 
members of the domains Bacteria and Archaea (KAAS identification codes 



pfa, eco, son, cje, gme, sme, rsp, mtu, bsu, cac, ctr, bfr, fjo, emi, cau, tma, 
mja, afu, pho, tac, ape, sso, pai, tne, tko, pab, pfu, mma, aae, dra, det, cte, 
pma, syw, fnu, fsu, cao, sru, lil, and fra). Searches were run independently 
in both bidirectional and single-directional best-hit modes. Additional 
searches for specific genes (Fig. 4; see Table SI in the supplemental mate- 
rial) were conducted by generating a diverse reference set from 75 bacte- 
rial and archaeal genomes in the IMG database (73) and using these as 
queries for BLAST (74) to search for potential homologs within the CP 
genomes. 

Nucleotide and amino acid sequence accession numbers. All of the 

sequences and annotations determined in this study can be accessed at 
http://genegrabber.berkeley.edu/aac. Sequences are also available at 
NCBI through BioProject numbers PRJNA217185 
(RAAC1), PRJNA217183 (RAAC2), PRJNA217186 (RAAC3), and 
PRJNA216121 (RAAC4). 

SUPPLEMENTAL MATERIAL 

Supplemental material for this article may be found at http://mbio.asm.org 
/lookup/suppl/doi: 10.11 28/mBio.00708- 1 3/-/DCSupplemental. 

Figure SI, EPS file, 5.7 MB. 

Figure S2, EPS file, 8 MB. 

Figure S3, EPS file, 0.7 MB. 

Figure S4, EPS file, 0.7 MB. 

Figure S5, EPS file, 1.4 MB. 

Figure S6, EPS file, 1.4 MB. 

Figure S7, EPS file, 1.3 MB. 

Figure S8, EPS file, 0.8 MB. 

Table SI, PDF file, 0.1 MB. 

Table S2, PDF file, 0.1 MB. 
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